Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix; incorrect length was used to check for discarded areas #103

Merged
merged 3 commits into from
Jan 2, 2024

Conversation

pontus
Copy link
Contributor

@pontus pontus commented Dec 21, 2023

This fixes a bug that could lead to incorrect data being returned in some cases.

When doing reads, we determine the largest amount of data we can read in a single swoop to check that area for holes caused by any edit lists. This is the lowest number of the length of the slice we should store data in and the readable data left in the buffer.

The computation incorrectly used the number of bytes that had been read rather than bytes that can be read, leading to the incorrect part of the logical decrypted stream being checked for holes.

Checking an area that is too large is not a problem, but if the area to check is too small, it could lead to incorrect data being returned if the smaller bit does not contain holes but the correctly calculated (based on the amount of bytes that can be read from the decrypted buffer) bit does.

As an added precaution, this PR also makes sure to not try to read more than the calculated bit.

@codecov-commenter
Copy link

codecov-commenter commented Dec 21, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7c6f163) 66.24% compared to head (8ec0094) 66.29%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #103      +/-   ##
==========================================
+ Coverage   66.24%   66.29%   +0.05%     
==========================================
  Files           6        6              
  Lines        1176     1178       +2     
==========================================
+ Hits          779      781       +2     
  Misses        278      278              
  Partials      119      119              
Flag Coverage Δ
unittests 66.29% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@teemukataja
Copy link
Contributor

teemukataja commented Dec 21, 2023

Our issue still seems to persist: attempting to get a larger chunk returns less data than requested (generated a 100MB test file for this).

	dataEditListHeaderPacket := headers.DataEditListHeaderPacket{
		PacketType:    headers.PacketType{PacketType: headers.DataEditList},
		NumberLengths: 2,
		Lengths:       []uint64{10000000, 30000000},
	}

This is expected to return 20_000_000 bytes, but it returns 616832.

Running on v1.7.1 returns the 20MB chunk as expected.

@pontus
Copy link
Contributor Author

pontus commented Dec 21, 2023

30M I assume here? Is this exact code that gives that result or is that something you can share? Similarly, what's the exact size in bytes of the file here (can it be shared easily)?

Is this with a separate testprogram or through sda-download?

@teemukataja
Copy link
Contributor

The code is the unit test in this PR, but instead of sample.txt I used fallocate -l 100M 100mb.bin which is of size 100_000_000.

This is how we used the data edit list to decrypt parts of the file between specific coordinates, and it worked like this up to version 1.7.1:

startCoordinate=10000000&endCoordinate=30000000

NumberLengths: 2
Lengths: []uint64{10000000, 30000000}

provided us with 20000000 bytes between those locations (from X to Y).

Are you saying the data edit list in and before that version was not working correctly? How would it then be used to read a section between X and Y bytes? And how does it work with your example with 3 Lengths Lengths: []uint64{0, 100, 300}?

@pontus
Copy link
Contributor Author

pontus commented Dec 22, 2023

Yes, that sounds incorrect. Data edit list handling is described in 4.2 in
https://samtools.github.io/hts-specs/crypt4gh.pdf.

My "simple" interpretation of that is alternating values of how much to discard and how much to keep, with a missing final keep being interpreted as copy the rest "until the end of file"

Anyway, those values being counters rather than file offsets means the data edit list as above would discard the first 10000000 bytes, provide the next 30000000 and then nothing more, so you should get what is between file offset
10000000 and 40000000, yielding 30000000 bytes.

But looking at https://github.com/neicnordic/sensitive-data-archive/blob/main/sda-download/api/sda/sda.go#L262-L264, it seems this was done correctly in sda-download.

I'll try experimenting a bit running with running the unit tests with a larger file size for the sample file (right now I'm seeing a failure because of a non-failure, but that being in the writer hasn't been touched for a long time).

For extra clarity: odd number of lengths in a data edit list means that the rest of the file will be provided after the last discard, so [0,100,300] means skip 0 bytes (so copy from start), transfer 100 bytes, skip 300 and copy the rest.

@pontus
Copy link
Contributor Author

pontus commented Dec 22, 2023

So, one of the tests had bad assumptions and has been fixed. There's also a test that (for better or worse) depends on the content being what it is in the repo and will fail with an empty file.

With the fixes in this branch, the tests work fine after extending test/sample.txt to half a gigabyte.

@teemukataja
Copy link
Contributor

Ok, it makes sense now, thank you! 😅

@blankdots blankdots merged commit 4449f2c into master Jan 2, 2024
6 checks passed
@blankdots blankdots deleted the del_bugfix branch January 2, 2024 09:24
@blankdots
Copy link
Contributor

blankdots commented Jan 2, 2024

I think we need to revert this, normal decryption does not work at all

steps to reproduce:

go run main.go encrypt -f=sample.txt -p=key.pub.pem

go run main.go decrypt -f=sample.txt.c4gh -s=key.sec.pem
#ends up in a loop

@blankdots blankdots mentioned this pull request Jan 2, 2024
@teemukataja
Copy link
Contributor

I think we need to revert this, normal decryption does not work at all

steps to reproduce:

go run main.go encrypt -f=sample.txt -p=key.pub.pem

go run main.go decrypt -f=sample.txt.c4gh -s=key.sec.pem
#ends up in a loop

confirmed, decryption hangs without producing anything

@pontus
Copy link
Contributor Author

pontus commented Jan 2, 2024

Oops. Yes, I guess it's best to revert for now (I don't think I can look today).

I think this fix is good and needed but something else might also be required.

(Although I'm a bit surprised I didn't see this while testing.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants